This is an R Markdown document that contains instructions and code for the examples used in today’s workshop. The first few steps will check that you have all the packages and data files necessary to carry out all of the analyses.
Sara’s section …
The following code chunk assumes that Brain Atlas files have been downloaded and placed in the “BrainAtlas” subdirectory of a folder in your home directory entitled “FestivalWorkshopSC”. If you have downloaded these files to another location, either create a new folder and move the files there, or substitute the file path to where they are currently located for “~/FestivalWorkshopSC/BrainAtlas”. If you are using the RStudio Server instance provided by the workshop, change the first line that follows to setwd("/home/FestivalWorkshopSC/BrainAtlas")
setwd("~/FestivalWorkshopSC/BrainAtlas")
file.exists("cell_metadata.csv")
## [1] TRUE
file.exists("genes_counts.csv")
## [1] TRUE
file.exists("ercc_counts.csv")
## [1] TRUE
file.exists("README.txt")
## [1] TRUE
If any of the preceding lines return FALSE, double check that you have set the correct working directory and that all download files have been placed in that folder. If they are missing, you can download the files here. Note that there will be more files than listed here (including alternate gene count quantifications, and metadata about data-driven cluster memberships as discussed in the paper “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics”), but these are the ones we will make use of.
The read.csv function in R is useful for reading in .csv (comma-separated value) files. First, we’ll read in the main data file genes_counts.csv to a data frame using this function and check its contents. The extra arguments to this function help to format our object so that we have row and column names, and we use character variables instead of converting to factors. For more details on these arguments, you can type help(read.csv). In the resulting coutns object, we have genes in rows (24057) and cells in columns (1679).
counts <- read.csv("genes_counts.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(counts[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame': 24057 obs. of 20 variables:
## $ A01101401: num 0 992 2.57 0 0 0 0 0 0 1 ...
## $ A01101402: num 0 2287 177 0 0 ...
## $ A01101403: num 0 492 0 0 0 ...
## $ A01101404: num 0 1932 1 0 0 ...
## $ A01101405: num 0 1425 2 0 0 ...
## $ A01101406: num 0 130 3 0 0 ...
## $ A01101407: num 0 2110 3041 0 17 ...
## $ A01101408: num 0 955 101 0 0 ...
## $ A02271433: num 0 326 0 0 0 349 0 0 0 0 ...
## $ A02271434: num 0 933 1042 0 0 ...
## $ A02271435: num 0 296 0 0 0 ...
## $ A02271436: num 0 31 200 0 0 984 0 0 0 36 ...
## $ A02271437: num 0 970 0 0 3 ...
## $ A02271438: num 0 594 355 0 228 ...
## $ A02271439: num 0 2290 3294 0 0 ...
## $ A02271440: num 0 29 1 0 0 ...
## $ A12101401: num 0 373 540 0 0 ...
## $ A12101402: num 0 462 197 0 0 0 0 0 0 0 ...
## $ A12101403: num 0 516 96 0 498 ...
## $ A12101404: num 0 1019 0 0 0 ...
The cell_metadata.csv file contains 1679 rows (one for each cell) and columns containing information such as collection date, sequencing type, total reads, mapping percentage, dissection layer, and major/minor derived cell subtypes.
cells <- read.csv("cell_metadata.csv", stringsAsFactors = FALSE, header = TRUE)
str(cells)
## 'data.frame': 1679 obs. of 16 variables:
## $ long_name : chr "A01101401" "A01101402" "A01101403" "A01101404" ...
## $ cre : chr "Calb2" "Calb2" "Calb2" "Calb2" ...
## $ collection_date : chr "11/18/2013" "11/18/2013" "11/18/2013" "11/18/2013" ...
## $ sequencing_type : chr "hiseq" "hiseq" "hiseq" "hiseq" ...
## $ total_reads : int 23770190 9694719 5864322 22102121 24057147 24171169 22447919 20995719 9023705 23016406 ...
## $ all_mapped_percent: num 93.5 92.9 90.5 93.2 93.1 ...
## $ mRNA_percent : num 54.4 45.7 48.3 51.4 51.1 ...
## $ genome_percent : num 30.1 35.5 34 33.8 32.1 ...
## $ ercc_percent : num 4.36 7.84 4.12 4.24 4.98 3.14 3.12 2.27 3.22 2.13 ...
## $ tdt_permillion : num 306 341 106 371 264 ...
## $ major_class : chr "Inhibitory" "Inhibitory" "Excitatory" "Inhibitory" ...
## $ sub_class : chr "Vip" "Vip" "L4" "Vip" ...
## $ major_dissection : chr "V1" "V1" "V1" "V1" ...
## $ layer_dissectoin : chr "All" "All" "All" "All" ...
## $ color_code : int 11 11 11 11 11 11 11 11 11 11 ...
## $ short_name : chr "A200_V" "A201_V" "A202_V" "A203_V" ...
The ‘ercc_counts.csv’ file contains 1679 columns (one for each cell) and 93 rows containing the counts for the ERCC spike-in control RNA transcripts (and spike in tdTomato). The ERCC spike-ins are a standard set of RNA transcripts that are spiked in at known concentrations to each cell (more on this later). We’ll remove the tdTomato transcript since this does not serve the same purpose as the ERCC spike-ins.
ercc <- read.csv("ercc_counts.csv", stringsAsFactors = FALSE, header = TRUE, row.names=1)
str(ercc[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame': 93 obs. of 20 variables:
## $ A01101401: int 167620 16658 35838 18394 0 0 0 0 0 0 ...
## $ A01101402: int 131187 10413 31563 20370 0 0 0 0 0 578 ...
## $ A01101403: int 37523 2248 9723 1342 0 0 0 0 0 185 ...
## $ A01101404: int 136565 4789 34892 8611 0 0 1354 0 1 0 ...
## $ A01101405: int 222462 15067 53617 19096 0 0 0 0 0 0 ...
## $ A01101406: int 125448 7055 31013 4439 0 0 0 0 0 0 ...
## $ A01101407: int 109946 9494 25467 4122 0 0 1 0 0 0 ...
## $ A01101408: int 79354 5755 17085 9585 0 0 0 0 0 296 ...
## $ A02271433: int 29555 5180 26581 3004 0 0 0 0 0 304 ...
## $ A02271434: int 58598 4188 29983 3939 0 0 0 0 0 233 ...
## $ A02271435: int 18494 836 8201 359 0 0 0 0 0 0 ...
## $ A02271436: int 10725 764 5968 619 0 0 0 0 0 0 ...
## $ A02271437: int 48311 1914 18605 1635 0 0 0 0 0 634 ...
## $ A02271438: int 31437 1279 12595 2102 0 0 0 0 0 332 ...
## $ A02271439: int 58475 5016 25756 11074 0 0 0 0 0 0 ...
## $ A02271440: int 127465 6838 67261 2358 0 0 0 0 0 0 ...
## $ A12101401: int 48418 2363 8437 3996 0 0 0 0 0 119 ...
## $ A12101402: int 77486 3819 17759 4358 0 521 0 0 0 0 ...
## $ A12101403: int 72009 4899 13997 4956 0 0 0 0 0 0 ...
## $ A12101404: int 62729 2947 16044 1551 0 0 0 0 0 0 ...
# remove the tdTomato row
whichTomato <- grep("tdTomato", rownames(ercc))
ercc <- ercc[-whichTomato,]
More detailed information about the files downloaded from the Allen Brain Atlas can be found in the README.txt file provided. Here is a peek at the contents of that file.
# This is a bash command, to be executed at the command line (not within R);
# Alternatively, simply open the README.txt in your favorite text editor to view its contents
head README.txt
## Description of files contained in this data download:
## 1. genes_counts.csv: File containing read count values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 2. genes_rpkm.csv: File containing the RPKM values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 3. ercc_counts.csv: File containing read count values obtained from the RSEM algorithm for each external ERCC spike-in control (row) for each cell (column)
## 4. cell_metadata.csv: File containing information about each cell profiled, including its nomenclature, Cre line of origin, dissection, date of collection and sequencing, and read mapping statistics
## 5. cluster_metadata.csv: File containing information about each data-driven cluster, including its label, the corresponding label in Tasic et al. (Nat. Neuro, 2106), the primary cell class membership, and marker genes (including genes with widespread expression in the cluster, sparse expression in the cluster, and no expression in the cluster).
## 6. cell_classification.csv: File containing information about the cluster membership of each cell, including whether the cell is a "core" (unambiguously assigned to a single cluster) or "transition" (shares membership between two clusters) cell, as well as its membership score (from 0-10) for each cluster (labeled f01 to f49).
##
## To generate the count and RPKM data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013.
## Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585
Once a list of desired R packages is finalized, can check that they are installed with
require(scde) #bioconductor
require(monocle) #bioconductor
require(scran) #bioconductor
require(scater) #bioconductor
require(Biobase) #bioconductor
require(scDD) #github
require(ggplot2) #cran
require(devtools) #cran
require(RColorBrewer) #cran
If any of these commands return a message that includes “there is no package called…”, then the package is missing and needs to be installed. Note that packages may be stored in one of several package repositories. The most popular are Bioconductor, github, and CRAN. For Bioconductor packages, for example edgeR, this can be done with the following code:
source("http://bioconductor.org/biocLite.R")
biocLite("monocle")
For CRAN packages, for example devtools, installation can be done with the following code:
install.packages(devtools)
For Github packages, for example scDD, installation can be done with the following code:
install.packages("devtools")
devtools::install_github("kdkorthauer/scDD")
# extract top 1000 variable genes
gene.var <- apply(counts, 1, function(x) var(log(x[x>0])))
counts.top1000 <- counts[which(rank(-gene.var)<=1000),]
counts.pca <- prcomp(log(counts.top1000+1),
center = TRUE,
scale. = TRUE)
summary(counts.pca)$importance[,1:5]
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 24.55390 14.19966 7.67291 6.620296 5.769057
## Proportion of Variance 0.35908 0.12009 0.03506 0.026100 0.019820
## Cumulative Proportion 0.35908 0.47917 0.51423 0.540340 0.560160
plot(counts.pca, type="l", main="Top 10 PCs")
color_class <- rainbow(length(unique(cells$major_class)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2],
xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$major_class))], pch=20,
main="PCA plot of cells colored by derived major class")
color_class <- rainbow(length(unique(cells$layer_dissectoin)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2],
xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$layer_dissectoin))], pch=20,
main="PCA plot of cells colored by Dissection Layer")
rm(gene.var, counts.pca, counts.top100) # clean up workspace
## Warning in rm(gene.var, counts.pca, counts.top100): object 'counts.top100'
## not found
Apua’s Section …
…
detectionRate <- apply(counts, 2, function(x) sum(x > 0) / length(x))
hist(detectionRate)
rm(detectionRate) #clean up workspace
…
Keegan’s Section …
In this module, we will identify genes that are highly variable across the entire cell population. This will give us a subset of genes to focus on that is likely enriched for those that are driving heterogeneity among cellular subtypes.
For ease in downstream analysis with various R packages we’ll make use of today, we’ll convert the data.frames that currently (separately) house the counts and cell metadata into a Bioconductor object called an SCESet introduced by the scater package. This object is a container that can hold raw and normalized expression values, along with the metadata for both samples and genes, in one place.
library(scater)
library(scran)
rownames(cells) <- cells$long_name
# construct a SCESet that also contains the gene counts, ercc counts, and metadata
all.equal(colnames(counts), colnames(ercc)) # check that the cells are in the same order across the two datasets
## [1] TRUE
counts.all <- rbind(counts, ercc) # combine the two into one data.frame
eset <- newSCESet(countData = counts.all, phenoData = AnnotatedDataFrame(cells))
isSpike(eset) <- grepl("^ERCC", rownames(eset)) #designate which rows contain spikeins instead of genes (for HVG analysis)
rm(counts, counts.all) # remove counts and counts.all matrices to free up memory (the counts are now stored in the eset object)
Now that our SCESet has been created, we’ll also perform some basic preprocessing and normalization here to adjust for library size using the pool and deconvolve method in the scran package, as well as remove genes with very low expression (*** these steps may not be necessary if this was already carried out in Apua’s section… Can refer back to the section or move/remove if it is redundant ***).
# QC to compare the level of dropout in endogeneous genes to ERCC spike ins in raw data
eset <- calculateQCMetrics(eset, feature_controls=isSpike(eset))
plotQC(eset, type = "exprs-freq-vs-mean")
Here we apply a couple of filters to remove genes that are expressed at a uniformly low level across the population, and then apply our chosen normalization procedure.
# first, filter out genes that are almost always zero (at least 50 out of 1679 cells must have nonzero expression)
keep <- rowSums(counts(eset) > 0) >= 50
eset <- eset[keep,]
sum(keep)
## [1] 15341
# next, filter out genes with very low average nonzero expression across all cells (requres a mean of at least 5 counts across all cells)
keep <- rowMeans(counts(eset)) >= 5
eset <- eset[keep,]
sum(keep)
## [1] 13461
# normalize counts for library size using the pool & deconvolve method of Lun et al. (2016)
eset <- computeSumFactors(eset, sizes=c(20, 40, 60, 80))
summary(sizeFactors(eset))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1164 0.7428 0.9664 1.1410 1.2770 8.7680
# use the size factors calculated above to normalize the counts - these get placed in the 'exprs' slot
eset <- normalize.SCESet(eset)
Let’s check the correlation of our size factors with the library size. The strong positive correlation suggests that the size factors calculated by this normalization procedure are primarily adjusting for sequencing depth and overall capture rate.
plot(sizeFactors(eset), colSums(counts(eset))/1e6, log="xy",
ylab="Library Size (Total Counts in Millions)", xlab="Pooled Size Factor Estimate",
main="Normalization factor versus library size")
Next, we’ll perform some calculations to help us identify genes that have high variability. To do so, we have to take into account the relationship between mean expression level and variance of expression level. Namely, that what we observe in RNA-seq count data is that genes with higher expression have higher variance (also commonly referred to as ‘over-dispersion’ in the literature). Note that for data on the log-scale, the trend is reverse (variance of log-expression decreases for higher mean log-expression genes). When overdispersion is present, this is also a property of the Negative Binomial distribution, which is often used to model count data from RNA-seq experiments.
We use the trendVar function of the scran package to estimate the relationship between the mean and variance. This function fits a smooth type of curve to capture the overall trend in mean log-expression and variance log-expression, which will serve as an estimate of the baseline technical variability if we assume that the majority of the genes are constantly expressed (i.e. that there is no significant biological variability in the majority of genes). Once this relationship has been estimated, the decomposeVar (also in the scran package) essentially removes the estimated technical variability component from the total variability to calculate the estimated biological variability. This is what we aim to use to find the highly variable genes. We also visualize the mean and variance log-expression relationship, along with the estimated technical variabilty fit.
var.fit <- trendVar(eset, trend="loess", use.spikes=FALSE, span=0.2)
var.out <- decomposeVar(eset, var.fit)
# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)
The dip in the beginning of the plot is due to the very low expressed genes having very low variance since they are dominated by zeroes and very low counts. We can see this if we remake the plot but color the points by the proportion of cells with a zero value.
# function to create a gradient of colors
color.gradient <- function(x, colors=c("orange2", "blue"), colsteps=500) {
return( colorRampPalette(colors) (colsteps) [ findInterval(x, seq(min(x),max(x), length.out=colsteps)) ] )
}
pzero <- apply(counts(eset), 1, function(x) sum(x > 0) / length(x))
# replot with color by proportion of zero
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression", col=color.gradient(pzero))
colscale <- c(0,0.25,0.5,0.75,1)
legend(11.5,40, title="Proportion cells zero", legend=paste0(1-round(quantile(pzero, colscale),2)),
col=color.gradient(quantile(pzero, colscale)), pch=16)
rm(pzero) # clean up workspace
Recall the assumption we made in estimating the mean-variance relationship above: there is no significant biological variability in the majority of genes. Is this true in our experiment? How might this assumption be evaluated?
If this assumption is suspect, it is ideal to independently estimate this relationship for so-called ‘spike-in’ data (where a small number of artifical transcripts have been added at known concentrations such that any variability is assured to be technical). Fortunately, in this experiment we have ERCC control spike-ins, so we’ll repeat the steps above to estimate the mean-variance relationship only on the spike-ins. We do this by using the use.spikes=TRUE option in the trendVar function. We also increase the smoothing span a bit since we are using fewer points for estimation.
var.fit.spike <- trendVar(eset, trend="loess", use.spikes=TRUE, span=0.3)
var.out.spike <- decomposeVar(eset, var.fit.spike)
# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out.spike$mean, var.out.spike$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)
o <- order(var.out.spike$mean)
lines(var.out.spike$mean[o], var.out.spike$tech[o], col="red", lwd=2)
points(var.fit.spike$mean, var.fit.spike$var, col="red", pch=16)
How similar are the two trend estimations? What does this suggest about the assumption that most genes do not have significant biological variability?
Finally, we extract the top 1000 genes with highest biological variability to use in downstream analyses. We’ll also examine the distributions of the top 25 highly variable genes. Here we will use the estimate of biological variability we obtained by fitting the spike-ins above. However, note that spike-ins are not always available, they can be challenging to incorporate into the protocol, and won’t necessarily capture all of the technical variability depending on where in the protocol they are added.
# extract and examine the top 1000 genes by biological variance
top.hvg <- order(var.out.spike$bio, decreasing=TRUE)[1:1000]
head(var.out.spike[top.hvg,])
## mean total bio tech
## Gad1 6.587978 45.91174 31.96631 13.945424
## Slc6a1 7.176197 42.18583 29.90373 12.282101
## Cnr1 8.608585 32.69374 24.90063 7.793106
## Enc1 9.133358 29.08431 22.97221 6.112099
## Rgs4 8.258177 30.38597 21.49162 8.894354
## Synpr 5.588791 36.68112 20.72181 15.959308
# construct a new eset object that only contains the highly variable genes for downstream analysis
eset.hvg <- eset[top.hvg,]
# plot distribution of the top 25 highly variable genes
top25 <- top.hvg[1:25]
boxplot(t(exprs(eset)[top25,]), las=2, ylab="Normalized log-expression", col="dodgerblue", main="Top 25 Highly Variable Genes")
Note that the biological variability estimates provided by seem to be rather robust to outlier cells, as none of the genes with highest variability seem to be driven by outliers.
Let’s also look again at the mean log-expression versus variance log-expression plot we generated above, but this time highlight the locations of top 1000 and top 25 highly variable genes.
# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression")
o <- order(var.out.spike$mean)
lines(var.out.spike$mean[o], var.out.spike$tech[o], col="red", lwd=2)
points(var.out$mean[top.hvg], var.out$total[top.hvg], col="darkorange2", cex=0.6, pch=16)
points(var.out$mean[top25], var.out$total[top25], col="red4", cex=0.7, pch=15)
legend(11.5, 40, legend=c("Top 25 HVGs", "Top 1000 HVGs"), pch = c(15, 16),
col= c("red4", "darkorange2"), cex=1.1)
rm(var.out, var.out.spike, var.fit, var.fit.spike) # clean up workspace
Finally, we’ll create a heatmap plot to visualize how the overall expression of the top 25 HVGs are associated with the major cell classes.
# extract matrix of hvg expression for plotting
m <- exprs(eset)[top25,]
# plot heatmap
library(RColorBrewer)
heatmap(m/apply(m,1,max),zlim=c(0,1),labCol=NA, col=brewer.pal(9,"YlOrRd"),
scale="none",ColSideColors=rainbow(5)[as.numeric(factor(phenoData(eset)$major_class))])
par(lend = 1) # square line ends for the color legend
legend("topleft", inset=-0.04,
legend = unique(phenoData(eset)$major_class),
col = rainbow(5)[as.numeric(as.factor(unique(phenoData(eset)$major_class)))],
lty= 1, lwd = 5, cex=0.6
)
rm(m) # clean up workspace
In this module we will identify differentially expressed genes between neuronal subtypes, as well as between neurons and non-neuronal cell types. As we have learned in this workshop, though scRNA-seq data shares many characteristics with bulk RNA-seq data, there are major differences which need to be accommodated in order to properly analyze it and fully exploit its advantages (namely the presence of dropouts, or zeroes, and increased heterogeneity from a combination of technical and biological sources).
First we will explore the method SCDE which compares expression magnitude between groups of cells while adjusting for drop-out and amplification biases in expression magnitude by fitting error models for individual cells.
After loading the SCDE package, we’ll create a vector of indices to indicate which cells we will analyze. In the interest of minimizing computation time during this workshop, we’ll only assess the top 1000 highly variable genes for differential expression (these are contained in the eset.hvg SCESet object). In addition, some of the SCDE output has been precomputed for you since the method is computationally intensive with a large number of cells (over 1000). To complete the rest of the module, you’ll need to download this results object from GitHub if you are not using one of the server instances (where it is preloaded) - see the instructions below.
We’ll first create a vector that indicates which of the cells have been classified as neurons. This will be used to subset the SCESet to contain only these cells. Then we’ll reduce this indicator vector to only contain a random sample of 100 cells of each neuron subtype. We’ll also create a group factor variable that indicates which type of neuron each cell is classified as (inhibitory or exitatory). Finally, since SCDE requires raw (unnormalized) integer count data as input, we’ll extract the raw counts matrix from the subsetted SCESet object and change their class from numeric to integer.
library(scde)
# find DE genes between excitatory and inhibitory neuronal subtypes
# first, find which cells are neuron (inhibitory or excitatory)
which.neur <- which(pData(eset.hvg)$major_class %in% c("Inhibitory", "Excitatory"))
# next, construct the group factor labeling for neuron subtype using the index of neuron cells `which.neur' to subset the SCESet object
group <- factor(pData(eset.hvg[,which.neur])$major_class)
# re-create the group factor for the subsampled cell indices
group <- factor(pData(eset.hvg[,which.neur])$major_class)
names(group) <- sampleNames(eset.hvg[,which.neur])
# save counts matrix as integer class; note SCDE requires raw (unnormalized) counts as input.
cts <- apply(counts(eset.hvg[,which.neur]),2,function(x) {storage.mode(x) <- 'integer'; x})
With the SCDE inputs prepared, we’re ready to fit the SCDE models for these genes. To do so, we first use the scde.error.models function to fit the cell-specific error models. These error models are then used as input to fit the models for the prior distributions of gene expression magnitude using the scde.expression.prior function. Then, both the error models and prior are used as input to the main function scde.expression.difference function, which performs a test of differential expression for each gene. The output returns a table that contains the value of the test statistic for each gene, so we’ll carry out an additional step to obtain the p-values that correspond to the test statistics. Note that the columns starting with a ‘c’ correspond to those that have been corrected for multiple comparisons.
Note that these model-fitting step can take quite some time, since an individual error model is fit for each cell in the dataset, and pairs of cells are compared to one another in order to estimate the parameters of the fit. For the purposes of this workshop, the error models object has been precomputed (If you are using the RStudio Server instance, simply load the scde.results.RData file. If you are using your own machine, you must first download this object from the GitHub page for this vignette - https://raw.githubusercontent.com/kdkorthauer/FestivalWorkshopVignettes/master/scde.results.RData and then save it to your R session’s current working directory).
# fit error models using the scde.error.models function (the following steps have already been run in the interest of time - it is computationally intensive)
########## err.mod <- scde.error.models(counts = cts, groups = group, n.cores = 10, min.nonfailed = 30,
########## verbose=1, save.crossfit.plots=FALSE, threshold.segmentation = TRUE,
########## min.size.entries = 500, save.model.plots = FALSE, linear.fit=FALSE,
########## min.pairs.per.cell=20, max.pairs=10000)
# estimate the prior for gene expression
########## prior.mod <- scde.expression.prior(models = err.mod, counts = cts, length.out = 400, show.plot = FALSE, max.quantile=0.9999)
# test for differential expression for all genes using the scde.expression.difference function
########## exp.diff <- scde.expression.difference(models = err.mod, counts = cts, prior = prior.mod, groups = group, n.cores = 1)
Again, the preceding steps are commented out since they have already been run for you. Instead, load the results object that contains err.mod, prior.mod, and exp.diff. We’ll add a column to the main results table that displays the p-values associated with the test statistics (both raw and multiplicity-adjusted).
# load the pre-computed results objects err.mod, prior.mod, and exp.diff (all contained
# in the file "scde.results.RData"). Assumes the file is saved to the current working
# directory
load("scde.results.RData")
# view the top of the main results table
head(exp.diff)
## lb mle ub ce Z cZ
## Gad1 0.58992655 0.9075793 1.1798531 0.58992655 7.153782 6.504284
## Slc6a1 0.45378966 0.8168214 1.1344741 0.45378966 5.496958 4.860499
## Cnr1 0.04537897 0.7260634 1.3613690 0.04537897 2.090673 1.461324
## Enc1 -0.86220035 -0.4991686 -0.1815159 -0.18151586 -2.868074 -2.220649
## Rgs4 -1.04371621 -0.7260634 -0.4537897 -0.45378966 -5.557564 -4.922400
## Synpr 0.77144242 1.1798531 1.6336428 0.77144242 7.151443 6.504284
# add a column with p-values and multiplicity-corrected pvalues
exp.diff$Pval <- 2*pnorm(-abs(exp.diff$Z))
exp.diff$cPval <- p.adjust(exp.diff$Pval, method="fdr")
We can explore these results to see how many genes are differentially expressed at the 0.05 level after adjusting for mutliple comparisons, and save these results in a .csv file (viewable in Microsoft Excel or a simple text editor).
# count how many genes have an fdr-adjusted p-value less than 0.05
sum(exp.diff$cPval < 0.05)
## [1] 184
# reorder the genes by pvalue
exp.diff <- exp.diff[order(exp.diff$cPval),]
head(exp.diff)
## lb mle ub ce Z cZ
## Gad1 0.5899266 0.9075793 1.179853 0.5899266 7.153782 6.504284
## Synpr 0.7714424 1.1798531 1.633643 0.7714424 7.151443 6.504284
## Zcchc12 0.7714424 1.1798531 1.542885 0.7714424 7.160813 6.504284
## Serpini1 0.6806845 0.9529583 1.225232 0.6806845 7.160813 6.504284
## Rab3b 0.8622003 1.2252321 1.497506 0.8622003 7.160813 6.504284
## Sst 1.2706110 1.9512955 2.768117 1.2706110 7.160813 6.504284
## Pval cPval
## Gad1 8.441938e-13 7.806413e-11
## Synpr 8.587054e-13 7.806413e-11
## Zcchc12 8.020003e-13 7.806413e-11
## Serpini1 8.020006e-13 7.806413e-11
## Rab3b 8.020000e-13 7.806413e-11
## Sst 8.020000e-13 7.806413e-11
# create a .csv file that lists all the DE genes with fdr-adjusted p-value less than 0.05
write.csv(exp.diff[exp.diff$cPval < 0.05, ], file = "scdeResults_DEgenes.csv", row.names = TRUE, quote = FALSE)
To get more insight into the models fit by SCDE, we can use the related scde.test.expression.difference function to visualize the results for a particular gene. For example, we can view the cell-specific posterior distributions for the two different neuronal subtypes for a DE gene:
# visualize the results for a particular DE gene
scde.test.gene.expression.difference("Gad1", models = err.mod, counts = cts, prior = prior.mod)
## lb mle ub ce Z cZ
## Gad1 -10.48254 -10.02875 -9.620341 -9.620341 -7.160847 -7.160847
In contrast, we can look at another gene that is not differentially expressed and see that the cell-specific posterior distributions do not look globally different between excitatory and inhibitory neurons.
scde.test.gene.expression.difference("Rgs17", models = err.mod, counts = cts, prior = prior.mod)